numeric data
HERCULES: Hierarchical Embedding-based Recursive Clustering Using LLMs for Efficient Summarization
Petnehazi, Gabor, Aradi, Bernadett
The explosive growth of complex datasets across various modalities necessitates advanced analytical tools that not only group data effectively but also provide human-understandable insights into the discovered structures. We introduce HERCULES (Hierarchical Embedding-based Recursive Clustering Using LLMs for Efficient Summarization), a novel algorithm and Python package designed for hierarchical k-means clustering of diverse data types, including text, images, and numeric data (processed one modality per run). HERCULES constructs a cluster hierarchy by recursively applying k-means clustering, starting from individual data points at level 0. A key innovation is its deep integration of Large Language Models (LLMs) to generate semantically rich titles and descriptions for clusters at each level of the hierarchy, significantly enhancing interpretability. The algorithm supports two main representation modes: `direct' mode, which clusters based on original data embeddings or scaled numeric features, and `description' mode, which clusters based on embeddings derived from LLM-generated summaries. Users can provide a `topic\_seed' to guide LLM-generated summaries towards specific themes. An interactive visualization tool facilitates thorough analysis and understanding of the clustering results. We demonstrate HERCULES's capabilities and discuss its potential for extracting meaningful, hierarchical knowledge from complex datasets.
multivariateGPT: a decoder-only transformer for multivariate categorical and numeric data
Loza, Andrew J., Kim, Jun Yup, Song, Shangzheng, Liu, Yihang, Sung, Joseph J. Y., Taylor, R Andrew, Shung, Dennis L.
Real-world processes often generate data that are a mix of categorical and numeric values that are recorded at irregular and informative intervals. Discrete token-based approaches are limited in numeric representation capacity while methods like neural ordinary differential equations are not well suited for categorical data or informative sampling and require augmentation to handle certain classes of trajectories. Here, we present multivariateGPT, a single architecture for modeling sequences of mixed categorical (including tokenized text) and numeric data. This is accomplished with an autoregressive sequence decomposition, embedding scheme, and loss function that extend the next token prediction task to likelihood estimation of the joint distribution of next token class and value. We demonstrate how this approach can efficiently learn to generalize patterns in simple physical systems and model complex time series including electrocardiograms and multivariate electronic health record data. This work extends the utility of transformer based models to additional classes of data.
ESG Rating Disagreement and Corporate Total Factor Productivity:Inference and Prediction
ESG Rating Disagreement and Corporate Total Factor Productivity:Inference and Prediction Zhanli Li ESG rating disagreement can lead to a decline in corporate total factor productivity When faced with ESG rating disagreement, reactive green innovation by enterprises does not lead to improvements in total factor productivity. Abstract This paper explores the relationship between ESG rating disagreement and total factor productivity (TFP) based on data from Chinese domestic ESG rating agencies and financial data of A-share listed companies in China from 2015 to 2022. On one hand, the empirical results show that ESG rating disagreement reduces corporate TFP, a conclusion that is validated through multiple robustness tests. The mechanism analysis reveals an interaction effect between green innovation and ESG rating disagreement. Specifically, in firms without ESG rating disagreement, green innovation promotes the improvement of TFP; however, in firms with disagreement, although ESG rating disagreement may drive green innovation, this does not lead to an increase in TFP. The heterogeneity analysis indicates that this effect is more pronounced in non-state-owned, asset-intensive, and lowpollution enterprises.
- Asia > China > Hubei Province > Wuhan (0.04)
- Asia > China > Chongqing Province > Chongqing (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Banking & Finance > Trading (0.93)
- Government > Regional Government > Asia Government > China Government (0.67)
Self-Supervised Predictive Coding with Multimodal Fusion for Patient Deterioration Prediction in Fine-grained Time Resolution
Lee, Kwanhyung, Won, John, Hyun, Heejung, Hahn, Sangchul, Choi, Edward, Lee, Joohyung
Accurate time prediction of patients' critical events is crucial in urgent scenarios where timely decision-making is important. Though many studies have proposed automatic prediction methods using Electronic Health Records (EHR), their coarse-grained time resolutions limit their practical usage in urgent environments such as the emergency department (ED) and intensive care unit (ICU). Therefore, in this study, we propose an hourly prediction method based on self-supervised predictive coding and multi-modal fusion for two critical tasks: mortality and vasopressor need prediction. Through extensive experiments, we prove significant performance gains from both multi-modal fusion and self-supervised predictive regularization, most notably in far-future prediction, which becomes especially important in practice. Our uni-modal/bi-modal/bi-modal self-supervision scored 0.846/0.877/0.897
Streaming Encoding Algorithms for Scalable Hyperdimensional Computing
Thomas, Anthony, Khaleghi, Behnam, Jha, Gopi Krishna, Dasgupta, Sanjoy, Himayat, Nageen, Iyer, Ravi, Jain, Nilesh, Rosing, Tajana
Hyperdimensional computing (HDC) is a paradigm for data representation and learning originating in computational neuroscience. HDC represents data as high-dimensional, low-precision vectors which can be used for a variety of information processing tasks like learning or recall. The mapping to high-dimensional space is a fundamental problem in HDC, and existing methods encounter scalability issues when the input data itself is high-dimensional. In this work, we explore a family of streaming encoding techniques based on hashing. We show formally that these methods enjoy comparable guarantees on performance for learning applications while being substantially more efficient than existing alternatives. We validate these results experimentally on a popular high-dimensional classification problem and show that our approach easily scales to very large data sets.
- Africa > Chad > Salamat (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > North Carolina (0.04)
- (2 more...)
- Health & Medicine > Therapeutic Area > Neurology (0.54)
- Education (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Modern Data Scientist
Modern Data Science with R is a comprehensive data science textbook for undergraduates that incorporates statistical and computational thinking to solve real-world problems with data. These days, you find thousands of job openings for the position of a Data Scientist. The AI and machine learning has surely taken this world by storm. Though AI was invented several decades ago, we started seeing its practical usefulness in just last few years. With applications as simple as predicting house prices to sophisticated ones like person detections in real time videos for surveillance, real time traffic monitoring and those based on textual data like ratings the hotels on the basis of their past customer reviews to areas like topic modeling, real time language translations and so on.
How to Prepare Your Data - KDnuggets
It is rare that you get data in exactly the right form you need it. Often you'll need to create some new variables, rename existing ones, reorder the observations, or just drop registers in order to make data a little easier to work with. This is called data wrangling (or preparation), and it is a key part of Data Science. Most of the time data you have can't be used straight away for your analysis: it will usually require some manipulation and adaptation, especially if you need to aggregate other sources of data to the analysis. In essence, raw data is messy (usually unusable at the start), and you'll need to roll up your sleeves to get to the right place.
End-to-End Machine Learning Course 2 Tensors
Tensor is just a multi-dimensional matrix. A tensor is usually a matrix of dimension 3 or higher. Scalar 1, a vector also known as a list or array [1,2,3], a two by two matrix [[1,2],[3,4]], tensor [ [[1,2],[3,4]], [[5,6],[7,8]] ]. A vector contains a bunch of scalars. A matrix contains a bunch of vectors. A tensor contains a bunch of matrices. You can check the data type of a variable in python using type(variable_name). In pytorch this will return the specific type of torch.tensor.
Machine learning algorithms explained
Machine learning and deep learning have been widely embraced, and even more widely misunderstood. In this article, I'd like to step back and explain both machine learning and deep learning in basic terms, discuss some of the most common machine learning algorithms, and explain how those algorithms relate to the other pieces of the puzzle of creating predictive models from historical data. Recall that machine learning is a class of methods for automatically creating models from data. Machine learning algorithms are the engines of machine learning, meaning it is the algorithms that turn a data set into a model. Which kind of algorithm works best (supervised, unsupervised, classification, regression, etc.) depends on the kind of problem you're solving, the computing resources available, and the nature of the data.